The
human ability to vocalize sound, as well as the
soundmaking abilities of other species, is amazingly complex
and endlessly fascinating. Speech, song, and non-verbal
behaviour is central to all aspects of individual and cultural
development, and therefore plays an extremely important role
in acoustic communication.
In many cases, when we turn to the study of acoustics and
psychoacoustics, we can find strong evidence that vocal
soundmaking establishes many of the norms for understanding
sound in general, particularly the way in which spectral and
temporal patterns are produced and processed by the auditory
system.
We can only summarize here the most basic information in this
extensive field, as subdivided into these categories.
A) The acoustics of speech
production: vowels and consonants
B) Linguistic descriptions of vowels and
consonants
C) Reading a sonogram
D) Voice and soundmaking on a personal and
interpersonal level; paralanguage
E) Soundmaking in cultural contexts
F) Cross-cultural forms of vocal
soundmaking
Q) Review Quiz
home
A. The acoustics of speech production.
The simplest acoustic model of the voice, as shown below, is
where an airstream originating in the lungs, supported by
the diaphragm, is the power source that has the option of
being periodically modulated by the vocal folds
(also known as the vocal cords – note the spelling), which
is then filtered through the vocal tract acting as a
variable resonator. This model is often called the source-filter
model and can be simulated with a periodic impulse
train, whose spectrum is the set of harmonics with
decreasing amplitude, that passes through the vocal tract
which is modelled by a filter to create a set of resonant
frequencies called formants.
The vocal folds open and close in
response to the pressure buildup, and their front end is
attached to the thyroid cartilage (or Adam’s apple), as
shown above, and therefore the movement is similar to what
is shown in the next diagram. The change in air pressure
with each opening and closing is called a glottal pulse,
the glottis being the opening between the folds, and when
there is a periodic train of such pulses, the sound
is said to be voiced, meaning that it will have a pitch
depending on the periodic frequency. If you feel the area of
your throat around the Adam’s apple, you can feel these
vibrations when you make a voiced sound such as a vowel.
Vocal folds
closing and opening (source: Denes & Pinson)
The vocal tract includes the larynx, the pharynx
above it, the mouth and optionally the very large nasal
cavity, all of which act as a variable resonator.
The main determinant of the frequency response of this
resonator is the tongue position, and this is how
different vowels are created. In the following sound
example, we first hear a single glottal pulse, followed by
that single pulse put through three different filters, and
with such a minimal model, it may be surprising that you can
already hear different vowels. This is because the spectral
envelope with its characteristic peaks is already activating
a spatial pattern of excitation along the basilar membrane,
as discussed in the second Vibration
module. Then we hear a periodic set of those pulses
(rather harsh but still a harmonic spectrum), followed by
putting them through the three filters again to produce the
vowels "ahh", "eeh", and "ooo".
Glottal
pulse, then heard through 3 filters; glottal pulse
train, then through the same filters
(Source: Cook 5) |
Click
to enlarge |
An
alternative demonstration is a simple transition from a
closed mouth hum (which emphasizes the harmonics produced by
the vocal folds) to an open mouth, sustained vowel ”aah” in
this video version of
the example. Note how the waveform shown in the oscilloscope
becomes more complex after the transition, the spectrum gets
richer with the formant regions, and the sound gets louder.
A key characteristic of this system is that
the vibration of the vocal folds, producing a pitched
sound in normal speech, is largely independent of
the vocal tract resonances. In other words, higher or lower
pitches will sound like the same vowel when the tongue stays
fixed and therefore the resonances are also fixed. When you
sound the vowel “aah” at different pitches, for instance,
you will notice that your vocal tract (and tongue) stays in
one configuration. However, with the singing voice, it is
difficult to produce certain vowels on a high pitch.
Since formant frequencies are fixed for a given vowel,
they cannot be transposed in pitch and remain recognizable,
beyond about a plus or minus 10% shift in pitch (which is
less than a whole tone in music). On the other hand, the
speaking pitch range can easily vary over an octave or more.
The following example transposes speech by a 3:1 ratio, at
which point the sound of the voice is often referred to as
the “chipmunk effect” (because of a popular music
phenomenon in the late 1950s known as Alvin and the
Chipmunks which used this effect with humorous intent).
However, an acceptable degree of realism is restored by
using a phase vocoder technique to scale the
frequency and time domains independently. After the original
version in this example, it is transposed up by a factor of
three. Then, in the third part of the example, the pitch is
also raised by the 3:1 ratio, but the original formant shape
is kept intact, and comprehension is improved.
Original
speech example, then transposed by 3:1 without and
with formant correction
Source: Pierce 48 & 53 |
Spectrogram
of second and third parts; click to enlarge
|
Formant frequencies. What
determines the frequencies of the formants? First of all,
keep in mind that a formant is a narrow, resonant band of
frequencies, so when we refer to the formant frequency,
we are pointing to the centre frequency of that
band.
In order to understand how formants are placed, we can
recall the basic modes of vibration in a tube closed
at one end, and open at the other, as documented in the
first Vibration module. If
we imagine the vocal tract as such a tube (which would make
us look very odd!), then the resonant frequencies would be
the set of odd harmonics, because the fundamental
mode of vibration corresponds to 1/4 wavelength, and the
other modes to its odd multiples, as shown in the diagram
below.
There are theoretically an infinite number of those resonant
modes, but we will show just the first four, as the strength
of the higher ones falls off quickly. However, the actual
shape of the vocal tract is irregular, hence the broadening
of the resonant energy into the formant shape, but to start
from first principles, let’s return to the tube closed at
one end (the vocal folds) and open at the other (the mouth).
If the average male vocal tract is about 7” (17.5 cm) long,
and the fundamental mode is 1/4 wavelength, then a full
wavelength is about 28” (70 cm) which corresponds to
approximately 500 Hz. The odd harmonics, then are
1500, 2500 and 3500 as noted in the diagram. The female
vocal tract, being shorter, is likely to have formant
frequencies 10-20% higher than that.
The diagram also notes the positions of minimum pressure,
the nodes (marked N), and the maximum pressure
positions, the antinodes (marked A). The mouth is a
nodal position for the first formant, and the diagrams for
the other formants show the approximate position of the
nodes inside the vocal tract. The rule is that as the
area of the vocal tract increases at a nodal position, the
formant frequency rises as well. Vice versa, if it
decreases, then the formant frequency falls. An inverse
relation applies for the antinodes at a position where the
area changes.
The following table lists the first three
formant
(i.e. centre) frequencies F1, F2 and F3, (the most essential
ones to identify a vowel) for the adult male and female, as
well as for children. A simple example of the area of the
vocal tract changing is with the open mouth vowel “aah”
where the first male formant has risen to 570 Hz, and the
closed mouth “eeh” where it has fallen to 270 Hz.
Table of formant frequencies for
men, women and children
Note that for these typical measurements, the vowel has been
placed in a “consonantal environment”, namely a soft attack
“h” and a hard consonant ending “d” in order to standardize
the formant positions. In a continuous speech stream, the
tongue may need to move to a new position right after the
vowel (for instance, to a closed mouth “n”) which will cause
the vowel formants to shift. Therefore, this table
represents a static situation.
Perry Cook has programmed a vocal
simulator based on the
source-filter model, and in this next example he has
modelled several tongue positions and their respective
formants. However, only the first three are recognizably
correct and plausible, and the last two are “unreasonable”
because they can’t be physically produced by any tongue
position. You’ll probably find them quite funny, as you try
to imagine someone contorting their mouth into such a weird
shape!
Five
vocal tract simulations, the last two
"unreasonable"
Source: Cook 39
|
|
The Singing Formant. The Swedish
speech acoustician Johan Sundberg studied the singing voice
with a great deal of research into the scientific principles
involved, during the 1970s and until his retirement in 2001,
culminating in his book The Science of the Singing Voice.
Here we will only outline one major finding, with the
recommendation of reading his work further and viewing some
of the lectures that are available online.
In particular Sundberg is associated with identifying the singing
formant. This refers to a vocal technique that is
developed mainly in opera singers to project their voice
over an orchestra. The difference in that type of voice is
the presence of an added (or perhaps “amplified”) resonance
region in the 2-3 kHz range. Given our sensitivity
to those frequencies, this added spectral component allows
the unamplified vocal sound to be heard above a full
orchestra. Here is an example comparing this type of trained
voice to a more conventional one, using vocal synthesis.
Singer
with and without the singing formant; synthesized voice
Source: Cook 42 |
Click
to enlarge
|
The
historic reason for this development in the Western operatic
tradition is an interesting confluence of acoustic and
cultural influences. In the second half of the 19th century,
opera houses with performances for a larger paying audience
had become the norm, and at the same time, many of the
orchestral instruments had been re-designed to be louder in
order for their sound to fill the hall. Also, in many cases,
the actual number of instruments in the orchestra had
increased, sometimes to over 100.
Opera singers prior to that period generally had lighter and
more agile voices, culminating in the “bel canto” repertoire
of the first half of the 19th century. However, it was very
difficult to make this style of singing louder without
losing pitch accuracy and the comprehension of the text. The
“solution” was the type of vocal enhancement described by
the singing formant, and the result was a new type of singer
that is now associated with “grand opera”.
The irony in the 20th century is that amplification could
have been used with the earlier types of voices, but the
prejudice against using that solution remains today (as
being too much associated with musical theatre). However,
historically informed performances of works prior to the
19th century have become more widely available, and singers
who specialize in that repertoire do not necessarily have to
resort to the singing formant approach.
Diphthongs. The above examples are
called the pure vowels because they have a fixed
tongue position and fixed set of formant frequencies.
However, the tongue and mouth can move during a vowel and
create sliding formants. These are called diphthongs,
which is hard to say and even hard to spell. We will list
the possible ones in English in the next section, but for
now let’s look at an extreme example, namely going from an
“aah” to an “eee”.
This diphthong seems extreme because we go from an open
mouth to a nearly closed one, but more dramatic is the
raising of the tongue towards the top of the mouth which
reduces the area at the antinode (A) between the two nodal
positions shown above. This results in a huge rise of the
second formant from 840 Hz to 2290 Hz (see the table under
the male “haw’d” and “heed” and check the position of the
antinode for the second formant in the diagram above it).
Also, try doing this yourself.
You can watch the transition in the video example, and
notice the huge gap in the middle of the spectrum with the
“eee”. The partial closing of the mouth also makes the sound
weaker, but in general we think of “eee” as a bright
vowel because of the two high formants (2nd and 3rd).
Finally you can compare the two waveforms of the
vowels in a steady state, as below. Because of the shape of
the glottal pulse, vowel waveforms are not symmetrical
above and below the zero axis, as in a sine wave or other
free vibration. If a simulation lacked that character, it
would not sound realistic.
Video
example of an
extended diphthong
Waveform of the vowel "aah"
Waveform of the vowel "eee"
|
Diphthong
spectrum
aah to eee
|
Consonants. Thus far we have been
exploring the source-filter model for vocal production, and
it can also be used for the production of consonants.
However, what it will produce is a spectral analysis
of the noise bands associated with the consonants in terms
of the airstream moving through the vocal apparatus. What it
omits is their temporal envelope which is arguably
more important for their identification.
Consonants, like speech in general, can be voiced –
that is, with the vibration of the vocal folds and hence a
pitch – or unvoiced, without those vibrations and
pitch (also called voiceless). Whispering is an
example of unvoiced speech, and yet we can still understand
it. We even think we can hear pitch rises and falls simply
from the cues provided by the vocal tract resonances.
In Hildegard Westerkamp’s 1975 work Whisper Study,
she whispers a sentence from Kirpal Singh quite close to the
mike. Despite the absence of any voicing of the vowels, the
text remains clear and the consonants are prominent. Compare
the spectrum for the whispered text with her speaking it
normally. Besides the consonants, the upper formants of the
vowels still retain their character.
(top) Spectrum of whispered text
(bottom) Spectrum of spoken text (click to enlarge)
|
Westerkamp
whispered text
|
Consonants
can be voiced or unvoiced, as we will document in the next
section, because they depend on the place and manner of their
articulation, as classified by linguists. Here we will simply
present the source-filter model using noise as a source (the
first sound) and four filter settings that produce differing
bandwidths and distributions of the noise, simulating the
unvoiced consonants "fff", "sss", "shh" and "xxx".
Noise
band put through 4 filter settings to simulate
consonants
Source:
Cook 30
|
Click
to enlarge
|
Index
B. Linguistic descriptions of vowels and
consonants. The basic units of a spoken language are
the phonemes
which are specific to an individual language, and
generally classified as vowels and consonants. English,
for instance, has about 44 phonemes. Linguists categorize
vowels according to the tongue position, and
sometimes display a graph of the first formant mapped
against the second. This implies that only two
formants are needed to identify a vowel, which is
fine for linguistic purposes, but in terms of a
re-synthesis of the voice, the third formant (at
least) should be included for realism, as listed in the
above table.
The small oval diagram at the left below shows the
position of the tongue within the mouth (which is placed
at the far left) in its extremes of sliding towards the
back of the mouth along the hard palate at the top, then
to the soft palate at the back, as shown in the upper
curve of the diagram. The lower curve follows the tongue
sliding back along the bottom of the mouth. You can try
doing this consciously, but it feels tricky to do smoothly
as it’s not a normal type of tongue movement in the mouth.
Instead of the oval mapping of the mouth, linguists prefer
to show the vowels in the vowel quadrilateral as
shown at the right. The extreme positions that are
numbered are called the cardinal vowels (which
were also shown in the oval diagram). These represent
theoretical tongue positions that are not necessarily used
in any language, but a linguist can be trained to produce
them.
|
Vowel quadrilateral (left) and linguistic
representation of vowels and diphthongs (right)
Source: Denes & Pinson
|
The solid circles mark the tongue position for the pure
vowels we have been referencing in the first
section, since they involve a fixed tongue position in the
mouth. Again, you will see the “ee” vowel towards the
front of the mouth on the upper left side and the “aah”
vowel towards the back and bottom of the mouth on the
lower side of the diagram. The bright vowels are
usually at the front of the mouth, and the dark
ones towards the back.
Notice that the “hesitation” vowels (“er” and “uh”)
have the tongue placed near the centre of the mouth –
ready to move once you decide what you want to say! The
formants for those vowels will likely be the closest to
the odd harmonic spacing we saw in the modes of vibration
model above.
Linguists use phonetic spellings of the vowels,
and to help us understand them, usually provide a common
word to help us out (e.g. the I for the short
vowel “i” as in “hit”), as shown in the table at the
right.
Likewise we can see the five diphthongs that are
used in English in the table. If you practice each one
slowly you can feel how your mouth and tongue move to
create them. For instance, “ou” and “au” as in “tone” and
“shout”, involve a rounding of the mouth. However the long
“a”, labelled as “ei” as in “take”, and the long “i”,
labelled as “ai” as in “might”, are usually performed so
quickly that you might not think of them as diphthongs,
but try them slowly. Similarly with the "oi" as in "toil".
Try saying the word I (referring to yourself) very slowly
and you can feel yourself doing something similar to the
diphthong shown in the previous video (“aah” going to
“eee”). In practice, you might do this when you are
hesitating on the word I, when you don’t know which verb
you are going to use next, but in colloquial speech this
gets done very fast, and in fact the mouth just moves
towards the “eee” part of the sound without spending any
time there. So the brain's recognition pattern mechanism
just reacts to this suggested movement of the mouth
without you actually needing to hold the “eee” portion.
Different accents in spoken English around the world will
likely involve slightly different tongue positions from
those indicated on the vowel quadrilateral, and of
course native speakers have practiced the appropriate
musculature movement since childhood. Similarly, some
dialects introduce diphthongs into what would
normally be pure vowels, as in the famous “Southern
drawl”.
Learning a new language as a adult is notoriously
difficult, because of the learned musculature habits that
get in the way, among other issues. However, here’s an
example for how this can be re-learned. The French vowel
as in rue (a street, not the English word to
regret) is articulated at the front of the mouth, whereas
in English its equivalent is at the back. So if you don’t
want to sound like a dumb foreigner in France, try this
exercise. Use your learned mouth position for the vowel
“eee” and then round your lips, and this will get you in
the right ballpark – or perhaps the right street!
The consonants are classified
by linguists by their place and manner of articulation,
as shown in this table. The “place” in question is the
position of the tongue, going backwards from the
lips “labial”, to the teeth “dental” (and an intermediate
position in between), then the gums “alveolar”, the top of
the mouth “palate”, the soft palate at the back “velar”,
and at the very back of the mouth “glottal”. As mentioned
earlier, consonants can be voiced or unvoiced (as
marked by “voi” and “unv” respectively in this table).
Consonants organized
according to their place and manner of articulation
The manner
of articulation is key to the temporal pattern
of the consonant, and hence its recognition. The plosives
are created by blocking the air flow momentarily at the
mouth, and then releasing it, a kind of “explosion”. This is
well known to recordists for the pros and cons involved. The
down side is that if the explosion of air goes straight into
the microphone it will create a low frequency “pop” which
can be very distracting (but can be filtered out, as shown here). A windscreen can help,
but you can also place the mic about 30° to 40° degrees off
centre. Try placing your hand in front of your mouth and
vocalizing a “pah” sound; then move your hand to the side
until you no longer feel the air being expelled.
The useful side of the plosive for a sound editor, is the
fact that there is a short pause before the
consonant – and hence a perfect place to make an edit if it
is needed. The brief attack of the plosive may even
momentarily mask a slight change in ambience that might
otherwise be noticed.
Note that the plosives come in pairs of unvoiced (p,
t, k) and voiced (b, d, g). This voicing is brief,
but if you feel your throat around the Adam’s apple, you can
sense the short vibration involved along with the airstream,
even near the back of the mouth (the velar g).
The pattern recognition mechanism for a plosive can be
triggered merely by inserting silence into a set of
phonemes. Keep this in mind if someone carelessly suggests
that information is “in” a sound or a signal. Just as you
are more likely to notice a continuous sound only after it’s
stopped (since you have been habituated to it previously),
it is patterns of sound and silence that provide
information to the listener. In this simple example, on the
left, silence is inserted into the word “say”, first at 10
ms, then increasing to 20, 40, 60, 80, 100, 120 and 140 ms.
What new “word” do you hear?
Increasing
silence introduced into the word "say" Source:
Cook 44
|
Increasing
silence introduced into the phrase "gray ship"
Source: Cook 47
|
To add a
bit of further complexity to the example, many people hear
the common word “stay” which includes the alveolar plosive
“t”. However, for animal fanciers, the less common word
“spay” (i.e. to neuter a female animal) may also be heard.
In the right-hand example, there is a rather famous example
of perceptual re-organization where silences are edited into
the phrase “gray ship” where many variants of the words can
be detected.
Like many similar consonants, a key
difference that enables us to distinguish them is the shape
of the lips and mouth during their articulation. Most people
do not realize the extent to which they rely on lipreading
while listening to someone speak, although if they are
experiencing hearing impairment, they will start depending
on that skill to a much greater extent.
The plosive “b” involves closing the lips, whereas “d”
involves tongue movement around the gums, followed by the
same act of opening the mouth to expel the air. The “g”
plosive involves the tongue at the back of the mouth against
the soft palate. In a famous experiment called the McGurk
effect, you are asked to listen to some repeated
plosives without seeing the speaker, and then compare what
you’ve heard to seeing the speaker at the same time. Does
the sound you hear change? Try it with this video, which has no visuals at
first, followed by a close-up of the speaker.
Most everyone hears a “ba ba” in the audio only track, and
“da da” in the audio-visual version. This is because the
speaker’s lips and mouth are articulating a “d” plosive
while the synchronized soundtrack is a “b” plosive. It is a
good demonstration of how the visual and auditory faculties
interact – you’ll hear it once you see it.
The fricatives, as the name
suggests (as in friction), involve constricting the air
flow with the tongue position as indicated in the
table. Unlike the plosives, knowing which version of the
fricative to use, voiced or unvoiced, is a tricky issue for
foreigners learning English, as you cannot tell from the
spelling of the word in all cases. Possibly the worst is the
“th” fricative in English, first because it requires you to
put your tongue between your teeth – something that would
likely be regarded as rude in other cultures!
Notice the phonetic symbol θ for the unvoiced “th”
to distinguish it from the voiced version “th”. An
unvoiced example is “thin”, with the voiced version as in
“this”. Try saying “this thin” normally and then with the
phonemes reversed! Not easy, but how is one supposed to
learn which is used? Maybe you’ll understand why it
sometimes comes out as “dis” in a foreign accent.
The alveolar and palatal fricatives (ss, sh, ch and their
voiced versions, z and zh) are known as sibilants,
which are characterized by their strong energy in the high
frequency range, namely 5 - 10 kHz. Their presence is
collectively known as sibilance.
When their strength is augmented by close miking, as in
radio, they can be subjected to attenuation, called de-essing.
Here is a video
demonstration of the four pairs of unvoiced and
voiced fricatives where you can easily see the added
periodicity of the voiced version in the oscilloscope and
the dramatic spectral addition of the lower harmonics in the
spectrogram. These are good examples of pitch plus noise
in a sound. You can probably do the same switching on and
off of the voicing much more rapidly than in the example -
try it!
The glottal fricative “h” is only unvoiced and
produces a soft, airy attack for the next vowel (and can be
silent in some words, such as “hour”), and as such is used
as a neutral consonant before a vowel in the formant
frequency table above. In some languages, it is given more
of a guttural articulation, as in the Dutch “g”.
The semi-vowels (y and w) are
regarded as consonants because they are always linked to the
following vowel with a soft attack. They are both voiced,
with the “y” formed by putting the tongue towards the front,
similar to “ee”, and the “w” is created by rounding the lips
similar to an “oo”.
The two “liquid” consonants, l ("el") and r,
are voiced and usually described as having air flow around
the tongue (hence the fluid name), positioned near the gums
for the “l” and farther back for the “r” which is also
called a rhotic. In some languages and dialects it
is “rolled” which means amplitude
modulated.
The nasal consonants are also voiced, but because
the air is blocked from coming out of the mouth, they are
resonated in the very large nasal cavity (the
largest in the skull) connected to the vocal tract behind
the soft palate. Their strength can be subtle or strongly
pronounced (try emphasizing them, the m, n and ng and feel
the vibration in your head). Since there is no mechanism in
the voice for inharmonic modes of vibration (the
vocal folds always produce harmonics), we often resort to
the nasal consonant “ng” when we want to imitate a metallic
sound such as a bell ringing.
To conclude, the consonants act similarly to the
attack transients in a musical instrument or other
percussive sound. They arrive first at the auditory system,
are spectrally complex in terms of noise bands, and we are
very sensitive to their temporal shape. If they are missing,
masked or otherwise muted through hearing loss, speech
comprehension will quickly decrease.
Admittedly we can sometimes “fill in the blanks”, such as
when the high frequency sibilants are not transmitted over a
phone line, because of the redundancy in speech and the
familiarity of most words. However, many words differ only
in the consonants being used, or become indistinct because
of slurring words together. However, speech recognition also
reminds us that both spectral and temporal information
are being simultaneously collected in the auditory system,
and taken together can efficiently produce a great deal of
information.
Index
C. Reading a sonogram. We will now
return to the representation of speech in the sonogram
(or spectrograph), as introduced in the second Vibration module. Since
the 1940s, this visualization of speech has shown the
intricate acoustic structure of speech in a 3-dimensional
representation (frequency and time on the y and x axes,
respectively, with darkness of the lines showing
amplitude). Today there are many more colourful versions
of the same type of representation. The linear frequency
scale in this case is useful to examine the important role
of high frequencies.
Speech sonogram
(source: Denes & Pinson)
In this
example, we are now in a position to discuss some features
of (slow, in this example) connected speech that are missing
from the categorizations used above. We will follow the
various phonemes being represented according to the text at
the top:
- the first word “I” is a diphthong,
and here we can see the characteristic rise in the second
formant towards the “ee”, without it being sustained at
the end
- the
consonant “c” (the plosive “k”) is clearly unvoiced
(no low frequencies) and resides in the 2-5 kHz range,
which distinguishes itself from an “ss” but not
necessarily a “t”
- the
vowel “ah” that follows is a pure vowel, but
notice that its formants glide slowly towards the closing
nasal “n” which required a closed mouth, hence the reason
for the formant shift; however, this does not make
it into a diphthong; the vowel recognition depends only on
the initial formant frequency placement
- the sibilant
“ss” is clearly very high frequency and unvoiced, also
longer, so the only ambiguity would be “ss” or “sh” but in
this case the frequency band is higher, so it’s an “ss”
-
again, another pure vowel “ee” gliding towards a semi-vowel
“y” because the tongue is moving towards the front; the
weak semi-vowel “y” then is just the soft start of the
final vowel which probably in the original ended with the
lips closed, hence the final drop in formants.
As you
can see, speech recognition by a machine has always been a
kind of elusive “holy grail”, because there are so many
patterns to recognize. It is also very difficult to do
without reference to syntactical knowledge, and sensitivity
to the variations in individual voices which usually
requires the user to “train” any algorithm, not to mention
how large a vocabulary is needed. Recently of course,
several successful apps have become available on our
smartphones, which represents a great deal of progress in
artificial intelligence.
Filtered speech. Although this
example does not involve a sonogram, we will show a larger
scale spectrogram of a male voice that is filtered into
seven frequency bands, each about two octaves wide, with
centre frequencies that move up one octave each time. The
voice is that of a West Coast indigenous elder, Herb George,
using a mix of English and Indigenous place names in the
area where he lives. For each of the seven bands, try to
identify which parts of the vocal spectrum you are
hearing, and which band(s) make the speech the most
intelligible. The centre frequency of each band is:
(1) 125 Hz (2) 250
Hz (3) 500 Hz (4) 1
kHz (5) 2 kHz (6) 4
kHz (7) 8 kHz
Male voice filtered into seven bands
an octave apart
Source: Herb George
WSP Van 113 & 114 |
Click to enlarge
|
Here is
what you are most likely to hear in each band according to
its centre frequency:
(1) 125 Hz. You hear the fundamental pitch of
the voice which identifies it as male, and what is
clearest are his pitch inflections (as discussed in the
next section) and their rhythm
(2) 250
Hz. This band is louder and is mainly the lower formants
of the voice but no consonants. Pitch inflections are
clear, but the words are not intelligible, though you hear
when one is emphasized, and the timbre is muffled
(3) 500
Hz. The voice is brighter in this range and vowels can be
recognized, but there are no consonants
(4) 1
kHz. The speech becomes almost understandable even if
you’re hearing non-English words, as you are getting some
vowel and consonant information
(5) 2
kHz. This band is quite quiet and sounds distant, but this
is usually the one where you can understand the speech the
best if you listen carefully; phrases like “name of a
place there”, “and across Belcarra in the inlet” and his
humorous remark about “they say there were lazy people
there” seem quite clear
(6) 4
kHz. In this band you only hear consonants and very little
vowel information
(7) 8
kHz. There are only sibilants in this band and the sound
is the weakest
From this demonstration you can see why the
telephone bandwidth has to include the 2-3 kHz band.
Note that band 5 in the example, from 1-4 kHz, is
where the ear is most sensitive according to the Equal
Loudness Contours that were determined at the
same time as telephone technology was being developed. This
band includes both upper formants in the vowels and some
consonantal information.
To complement the discussion of consonants in
English, you can appreciate the complexity of these place
names from the same speaker. Note that the first three names
have the standard telephone bandwidth (300 Hz - 3 kHz) and
then open up to full bandwidth which makes the sound louder
and brings it closer.
Consonants
in a West Coast Indigenous language
Source: Herb George
WSP Van 113 & 114
|
Click
to enlarge
|
Index
D. Voice and soundmaking on a personal
and interpersonal level. Now that we have a grounding
of the acoustics and linguistics of speech, we can switch
levels to consider its larger communicational aspects. A
good place to start is how one’s own soundmaking mediates
the relationship to oneself, to one’s self-image and gender,
and then to the acoustic environment, leading to
interactions with others and larger social groups.
Since sound is a physical phenomenon, our own
soundmaking reflects the whole person, physically and
psychologically. It also reflects the interiority of the
body since all aspects of bodily and mental functioning
influence our ability to make sound. Friends and
acquaintances will easily detect through your voice when
there are changes in your state of health, mood or
personality.
Feedback of our own sounds back to the ears, as well
as feedback from our surrounding environment via reflections
and resonances (as documented in Sound-Environment
Interaction) all play a role in a basic orientation
of the self within a given space. In that module we
presented this example of a voice recorded in different
acoustic spaces, indoors and outdoors, and it’s worth
repeating here now that we have a more detailed
understanding of vocal spectra and resonances. Listen to how
much the timbre of the voice changes in each location (in
fact, some listeners were surprised to learn it was the same
voice).
Voice recorded in different spaces
from program 1,
Soundscapes of Canada
Aural feedback can also be disrupted
in a variety of ways:
- a loss of acoustic feedback at the extremes
of acoustic space,
anechoic conditions (with minimum reflected sound)
and a diffuse
sound field (with minimum absorption such that
sounds have no direction), or ones with high noise levels
such that you cannot hear your own sounds
-
hearing loss can make it difficult to judge the loudness
and clarity of one’s own speech, as can the condition of autophony
described here
- a
temporary disruption can be experienced with earplugs or
headphones; however, the effect is the opposite in each
case: your own sounds seem louder when heard via bone
conduction (termed occlusion) and therefore
you tend to speak more quietly, whereas with headphones
that block the air conduction route, you tend to speak
more loudly
- if a
significant time delay (about 1/4 second) is introduced
into your speech reaching the ears, for instance, via
headphones, you are likely to stop speaking altogether
-
similarly, when people hear a recording of their own
voice, they usually don’t think it “sounds like them”,
because they are used to a mixture of bone conduction and
air conduction in the feedback loop; bone conduction
transmits more low frequencies, and therefore their
absence in a recording makes it seem that your voice is
higher than you think it is (would that affect men more
than women?)
The
Soviet psychologist Lev Vygotsky (1896-1934) researched the
social and cognitive development of children, and one of his
key ideas was that egocentric speech in children
(speaking out loud to only themselves) around the ages of 3
to 4, was a precursor to inner speech and a form of
self-regulation of behaviour. When this kind of speech is
internalized after a few years, it assists listening,
thinking and cognitive development.
Interpersonal interaction and
paralanguage. Non-verbal communication, that is,
without using words, includes bodily and facial movement,
called kinesics, interpersonal distance, called proxemics,
and other sensory forms of behaviour including paralanguage
which is basically how words are spoken.
Paralanguage can be regarded as the analog (i.e.
continuous) aspects of verbal communication, compared with
the discrete, digital form of words. In fact, it is
what links words and phrases together to give them an overall
shape and rhythm. As such, and because of the types of
parameters it includes, it is often regarded as the
“musical” aspects of communication since similar terms apply
to both. In that sense, it describes the form of the
communication, the “how”, rather than the “what” is being
said.
In terms of hemispheric specialization as discussed
in the Sound-Sound Interaction
module, paralanguage is more likely to be processed in the
right hemisphere, compared to the language centres of the
left hemisphere (which is predominant in right-handed
subjects, but can be either right or left in left-handed
subjects). This means it is processed more as shapes and
contours, rather than discrete logical units. It also means
that it is less vulnerable to noise or distortion, since it
is an overall gestalt-like pattern, which is not as easily
degraded, for instance, by missing a word or two.
Also, unlike the digital form of words, analog communication
cannot be self-referential, paradoxical, or self-negating.
A sentence like this one can self-referentially describe
itself as a statement, or even say that it is untrue. On the
other hand, paralanguage and other non-verbal cues can neutralize
or even negate the linguistic content. For instance, a
phrase with negative or critical implications can be
delivered with the right paralanguage that indicates the
speaker is just teasing or joking. In fact, the real
intent is to reaffirm the speaker’s trust in the person the
remark is aimed at. That is, the speaker needs to know that
it won’t be taken the ”wrong way”, a recognition that the
literal meaning is so far from the actual truth that it can
be joked about.
However, as we will see below, when the paralinguistic
form seems to match the content of the communication,
it will be likely be received as genuine and sincere, but
when there is a mismatch, the listener will detect irony,
sarcasm or even misleading intentions and outright
manipulation.
But first, let’s look at the parameters of
paralanguage. Their values can range from normal, to
exaggerated (for emphasis or some effect), to stylized and
ritualistic, as illustrated later.
- pitch: the pattern of pitch
changes is called the vocal inflection, and can
be described as having an average, range and specific
contours; in tonal languages, pitch levels
define the meaning of a word, whereas in Indo-European
languages inflection
clarifies the intent of the communication
- loudness:
also has an average, range and specific in contours that
include patterns of stress on key words
- timbre:
the quality and texture of a voice that gives it a
character that is easy to recognize but difficult to
describe except in general terms (e.g. rough, smooth,
raspy, nasal, etc); vocal timbre is sometimes altered
for a special effect or purpose
- rhythm:
again, we can use the musical terms of tempo
(perhaps measured as words per minute, with an average
and range), as well as patterns of stress that
give it a “metre” (the number of beats per phrase)
- articulation:
the quality of being clear and distinct, ranging to
slurred and indistinct, or smoothly connected versus
jerky and disjointed
- non-verbal
elements: hesitations, emotional elements,
laughter, and other gestures that psychologist Peter
Ostwald calls “infantile gestures” because they can be
observed in babies that are pre-verbal
- silence:
perhaps the most important of all, described by Tom
Bruneau as “a concept and process of mind … an imposed
figure on a mentally imposed ground”
In terms
of Bruneau’s approach to silence, if we limit ourselves for
the moment to just psycho-linguistic silences, the
first of his three levels (the other two being interactive
silences and socio-cultural silences), we can introduce four
aspects of silence in speech that link it to cognitive
processing. According to Bruneau (Journal of
Communication, 23, 1973, pp. 17-46):
- silence is imposed by encoders (i.e.
speakers) to create discontinuity and reduce uncertainty
- silence
is imposed by decoders (i.e. listeners) to create “mind
time” for understanding
- it can
occur in “fast time” through horizontal
sequencing with high frequency of occurrence, short
durations and low emotional intensity
- it can
also occur in “slow time” which reflects semantic
and metaphorical processes, which Bruneau describes as
“organizational, categorical, and spatial movement
through levels of memory”; high sensory moments with
high emotional intensity are experienced in slow time
and/or silence
It will be
useful to keep all of these parameters in mind as you listen
to two interview excerpts, both of which have approximately
the same overall slow tempo, but the role of silence (and
paralanguage in general) is vastly different. The two
examples have been chosen to illustrate our hypothesis about
paralinguistic form matching content, and seeming
appropriate to it.
Does it clarify and put the message into context? What does
it reflect and reveal about the speaker and his/her
relationship to the listener? Is it a form of
metacommunication (i.e. communication about a
communication)?
There are two interviews: (1) a grandmother speaks to her
grandson about the sounds she grew up with on a farm; (2) a
retired policeman offers his views on native land rights
Interview
with a grandmother recalling sounds from her past
|
|
Interview
with a retired policeman about native land rights
|
Click
to enlarge
|
In the first interview excerpt, the
grandmother speaks slowly but steadily with a relaxed tempo,
leaving silent sections where she seems to be recalling
memories (and her grandson was wise not to interrupt those).
However, her pace (which has not be distorted by any
editing) maintains a steady beat and tempo where the most
important words land on the beat (try “conducting” the
recording to see how this works). Each sound memory is given
a vivid aural description with paralinguistic imitations of
the sounds being described, such as sleigh bells “ringing
out”, the train whistle on a “frosty night” which
would “ring across the prairie”, the “purring of a cat” and
the “cackle of hens” which she accompanies with laughter
inviting her listener to join in the amusement.
This interview went on for a very long time, as one memory
led to another, including some unexpected machine sounds
that she found memorable. Her pitch inflections are quite
free for each memory, generally starting higher and
descending to a more intimate level. Overall, she draws the
listener into the memory of each experience, and gives a
clear indication of how she felt about each sound. Her use
of silence is a good example of Bruneau’s concept of “slow
time” which invites reflection and metaphor.
Another example of “slow time” speech is Barry Truax’s
performance of John Cage’s “Lecture on Nothing” from
his book Silence where the text is spread out on the
page to indicate approximate tempo and rhythmic contours
(note, the text starts at 3:40).
In the second interview excerpt, the male speaker’s use of
paralanguage is the complete opposite. As can be seen in the
overall spectrum pattern, his phrases are stiff and
mechanical with no flow. The pauses come in syntactically
inappropriate places, e.g. after a single word, not complete
phrases, as he attempts to maintain control of the subject
matter and allow no intervention or dissent. The sing-song
pitch inflections are repetitive in predictable patterns
that can be arbitrarily applied to any subject, and the
stress points are on arbitrary words, such as “exception”,
“two locations”, “misadventure” and “treaty exists”.
The overall impression is that he is trying to sound
objective and logical (as a professional policeman would be
trained to do), but at key moments there are slips in
objectivity that betray what we come to suspect are his true
feelings. Euphemisms such as “enjoying the treaties”,
mispronouncing “publicity”, correcting his mistake about
“illegally taking it away”, ad hominem (and racist)
characterizations such as “rattled their bones in some type
of a war dance”, clichés such as “at this time” and “in
their wisdom”, all suggest he is repeating an official line
that is intended to hide his own feelings.
As a further example of interpersonal dialogue, you may be
interested in this detailed
analysis of four short interchanges between family
members that reveal a series of dynamic shifts in
paralanguage that reflect underlying issues they are
experiencing.
18.
Personal Listening Experiment. Try
listening to the paralanguage of several different
examples of speech, both on a personal and interpersonal
level, and those found in the media. Refocus your
listening away from the actual content of the spoken
message onto the contours of the phrases and their
rhythmic variations. If you want to be more analytical,
you can use the list of parameters above as a checklist.
If you are coming at this from the electroacoustic side,
you may want to record some examples and loop them similar
to the conversation analysis in the previous link. Ask
yourself what is being communicated that is not directly
in the text? What is being revealed and what remains
hidden? How aware are you of your own use of paralanguage
in typical situations?
Index
E. Soundmaking in cultural contexts.
Voices of power and persuasion.
There are countless examples of vocal styles used publicly
that could be used in this context, but we are going to
listen to some historical recordings that may be less
familiar to you, as well as other instances of stylized
soundmaking.
We begin with a formal political speech given by the
Canadian Prime Minister, William Lyon Mackenzie King, in
1925 to a Liberal convention in Montreal. He uses the
typical kind of projected voice required to be
heard acoustically in a large gathering (it’s unclear
whether any amplification would have been used at that
time, although it was picked up for radio). It is a
classic example of oratory characterized by a raised voice
(higher pitch and loudness), a slow steady beat, and a
series of inflection patterns that start higher and
descend to a cadence. The regular rhythm works to hold
the attention of the audience (in this case a favourable
one), and provides a framework within which a key word or
phrase can be emphasized.
King’s main point is that he needs to have a majority
government in the next election, and so his logical
argument, that extends over two minutes (no sound bites
here), develops in clear stages. It reaches a high point
about 90 seconds in, with the word “majority” on a higher
pitch. This sequence includes the traditional oral
technique of the “list”, a repeated set of similar phrases
that goes through its items one by one so the listener
experiences the time frame, rather than grouping them
together as a logical set – in this case detailing five
consecutive governments in the UK that eventually resulted
in a majority government. He then goes to the US context
where a third party had been rejected in favour of a
strong majority. Also notice how he “speaks into the
applause” to maintain continuity and the energy level.
W.L.M.
King's broadcast speech, 1925
Source: National Library
|
Click
to enlarge
|
Media
analysts often point to Franklin Roosevelt’s “fireside
chats” as being one of the first examples of political
speech adapted to radio, that is, in a relaxed manner
similar to speaking with a small group of friends, not a
large audience. In fact, there were only 30 such broadcasts
by the US President between 1933 and 1944, but they had a
great impact on the country during times of crisis.
A Canadian equivalent is less well known, namely the Rev.
William “Bible Bill” Aberhart, premier of Alberta from
1935-43, and founder of the Social Credit Party, who
combined religion with politics, and began broadcasting from
the Calgary Prophetic Bible Institute as early as 1927. He
had a regular program throughout the 1930s that reached a
wide rural area of the province, where he mixed religion
with a radical critique of the banking system and the
federal government.
In this example, broadcast on Aug. 29, 1937, you can hear
his low-key style in front of an audience in Edmonton that
is clearly designed to resemble a “fireside chat” with a
supportive congregation. In this excerpt he reads from two
letters sent by members of his radio audience (“rapidly” he
says, but actually at a slow steady pace). Note the short
phrases with similar inflection patterns and cadences,
particularly in the text noted on the spectrogram.
Rev. William Aberhart broadcast 1937
Source:
National Library |
Click to enlarge
|
Radio not only produced many
different styles of vocal behaviour, but also created
structures to foreground them, such as the “ad break”. We
presented a detailed analysis of one such break in the Dynamics module where an
announcer brought the listener out of the music and made a
smooth transition into the commercial ad, and then back
into the program content. In this sequence from the 1930s,
a somewhat similar structure was used in a radio drama
program to frame the ad and blur its boundaries.
The framework that is set up is that we’re in the
intermission between two acts of the drama, and that we’ve
gone backstage and can eavesdrop on the two female actors
during the break. The male announcer sets the scene,
coming out of the concluding music, and miraculously we
can hear the actors chatting about the previous scene. It
should be recalled that female voices were not acceptable
as announcers during this period, so an ad sequence with
two female voices was very unusual, though in this case
normalized by their being the same actors as in the play.
Radio voices had to be distinct and expressive
since the actors couldn’t be seen, and so the two females
actors have the stereotypical voices of Sally, the “lady
of the house” (high pitched, smooth timbre, sounding very
educated) and her “maid” Hilda (a broadly defined working
class accent with exaggerated inflections, low pitch and
raspy timbre). They discuss the previous scene in the
garden, eliciting the idea that the maid has a boyfriend,
Henry, whom she wants to impress, but the obstacle she
says is her face.
The lady asks her what soap she is using (the section
shown in the spectrogram), and the maid naively describes
its advertised promises, to which the lady takes on a
didactic stance, offering a medically endorsed (by a male
physician of course) alternative, the Ivory Soap brand.
Then there is more innuendo about Henry, and the announcer
smoothly returns us to the drama cued by “the bell for the
second act”.
Radio drama ad break, Ivory Soap,
1930s
|
Click to enlarge
|
One of the most highly stylized vocal
professions is that of the auctioneer. They are
usually male, and have achieved a high degree of
virtuosity in rapid vocal effects designed to entice the
listener into making a bid on something being sold.
Unlike the previous examples, actual intoning is
used by the auctioneer, that is, placing the words on a
sung pitch which has several advantages. First of all,
the auctioneer can prolong his “patter” much longer than
if he were purely speaking, since less breath is being
expelled. Secondly, this pitch attracts and holds the
listener’s attention and provides a recognizable tonal
centre from which departures can be made, in this case
at the end of each extended breath, the first going up,
the next going down. And thirdly, the sheer pace of the
patter increases the energy level and may induce the
audience’s awe for his virtuosity.
Moreover, this particular auctioneer has mastered the
art of alternating between a rapid rhythmic patter to
hold everyone’s attention, and the “breaks” (which
themselves attract attention) where he cajoles his
target customer with banter. Here his inflections go
much farther up and down than the patter does, and
allows for a free give-and-take plus humorous
interjections. This fellow is highly successful in
building up the bids with each iteration of this pattern
(one of which is shown in the spectrogram), until he
finally gets a high price for what he is selling (in
this case a pumpkin). As a footnote, the previous
unsuccessful auctioneer got a 25 cent bid, and this guy
got $2.50!
Auctioneer
Source: WSP Van 123 take 13 |
Click to enlarge
|
So
far we have only given examples of the solo voice, but
there are many instances where multiple voices
join together, not just in unison (which is also
impressive) but in competition. Sports events are well
known examples of this, but in the next example we get
a more intimate situation, a local softball game where
the team members and fans are particularly vocal about
encouraging their own team members, and intimidating
their opponents.
Paralanguage clearly
dominates semantic content – the form of the
communication is more important, and its goal is to
hype one player (the pitcher or the batter) and to
diss the other. Short punchy repeated phrases are
used, such as “Come on now, like you can … hum in
there” to the pitcher Benny, or “you watch ‘em now
fella” to the batter, Al.
The second example does not have opposing teams, but
has every player vying with everyone else. This is a
historic recording from the trading floor of the
Vancouver Stock Exchange, at a time when bids and
offers had to be shouted out by each trader (now
superseded of course by computers). The example is
remarkable for how different each person had to sound
to be noticed above all the others and the general
din. All of the paralinguistic cues are on display,
pitch inflections, timbre and loudness, in a perfect
symbol of market capitalism. We might even compare
this behaviour to the Acoustic
Niche Hypothesis where each species has
its own frequency band.
Softball
game
Source: WSP Van 107 take 3 |
Trading
floor, Vancouver Stock Exchange, 1973
Source: WSP Van 30 take 2 |
A unique
form of collective soundmaking that few people have
experienced is called glossolalia, commonly called
“speaking in tongues” as practiced by certain evangelical
congregations. The idea is not necessarily having the
ability to speak in a different language (which is the usual
interpretation of the Biblical account of the disciples at
Pentecost being given the power to go forth to other lands
and preach). Most studies find that the vocalization in
glossolalia is made up of phonemes and syllables in the
speaker’s own language, and so it represents a kind of free
vocal improvisation.
In this example, recorded in a church in Vancouver, the
individual voices can be clearly heard overlapping in a
random fashion. We won’t speculate on the religious aspects
of the practice, but simply note that public soundmaking
opportunities are generally regulated in an orderly fashion,
and other than unison behaviour at specific moments, there
are few opportunities to have complete vocal freedom in a
public context that is clearly devoid of any sense of
negative emotion (which leaves out major sports events). The
ability with glossolalia to maintain such vocalizations for
a long period indicates that it is not stressing the body,
but rather providing a safe outlet for self-expression
within a community group.
Group
participation in glossolalia at a Vancouver church
F. Cross-cultural forms of vocal
soundmaking. In many cultures, functional and
musical vocalization practices – which are usually deeply
intertwined – are highly complex and reveal a seemingly
infinite variety of vocal possibilities. Some of these
practices are endangered in the contemporary world, others
have been revived to some extent by enthusiastic
followers, and still others have managed to transfer
themselves through evolution into contemporary
multi-cultural practice.
The functional aspect of the various types of soundmaking
usually reflect and utilize the typical acoustic spaces they are
found in, and have clearly evolved in relation to them to
the point where they can be called languages. For
instance, the whistling languages found in many
parts of the world, the best known being in the Canary
Islands, are designed to communicate across mountainous
valleys since high frequencies will carry well in
the absence of absorbing obstacles, similar to the Swedish
examples of kulning presented below.
African drumming languages, often referred to as
talking drums, can also communicate over long distances
through the jungle and be relayed even farther. They can
imitate the tonal patterns of spoken language
through the resonances of the log drums. Of the countless
other examples we could draw on, here are just four
traditions from other cultures that we can briefly
introduce you to, in the hope that they will inspire you
to investigate them and others more fully.
Inuit. The traditional competitive tradition of
Inuit throat singing, known as katajjaq,
in the Canadian high arctic, was usually practiced by two
women facing each other at a very close distance,
sometimes close enough to use their partner’s mouth as a
resonator. They took turns uttering complex rhythmic
sounds on both the inhale and exhale of the breath,
filling in the gaps in the other’s sound. These sounds
were sometimes intended to imitate those of nature, or
even machines in more recent times. It was performed by
both adult women and young girls who could adapt the
practice to exchange gossip about their boyfriends, for
instance. The development of such diaphragmatic breathing
was probably beneficial for a cold climate. The cycle
ended when one ran out of breath, and laughter erupted as
in this recording.
Inuit
throat singing
|
Click
to enlarge
|
Sweden. In the north of Sweden, the
women who were tending small herds of cows and sheep in the
mountainous regions, developed an amazing singing style
called kulning that was partly functional – to call
their animals and to communicate with others across a
valley, for instance – and highly musical with extremely
high pitches and precise intervals produced with what is
called head voice. In some examples, the inflections
of speech directed at the animals flowed smoothly into these
sung pitch patterns as in the first example. In others, the
melodic lines are entirely sung with intricate ornamentation
as in the second example, perhaps reflecting the isolation
being experienced.
These are just two examples of those recorded by the Swedish
musician Bengt Hambraeus in the late 1940s and early 50s,
transmitted by a telephone line back to Stockholm. In some
recordings you can hear an echo from across the valley. With
no barrier to absorb the sound, these high pitches actually
carried farther than low-pitch sounds. The tradition
gradually disappeared from actual herding practice, but has
recently been revived by younger singers studying the
tradition.
Swedish
mountain shepherdess calls to her animals
|
|
Swedish
mountain shepherdess singing
Source:
Bengt Hambraeus
|
Click
to enlarge
|
Pygmy. The various Pygmy cultures in
sub-Saharan Africa, such as the Aka and Baka peoples, have
become well-known for their contrapuntal music and relaxed
vocal style. The first recording with two young girls shows
that even the children can master the interweaving of two
complex vocal lines, sung on the breath, each with very
precise musical intervals. The second example is a group of
men and boys who have returned from the hunt, and shows an
interlocking rhythmic style where each person contributes
their own line according to their ability. Some commentators
such as Colin Turnbull and Alan Lomax have suggested a
correlation between this musical style and their
traditional, communal lifestyle. The French
ethnomusicologist, Simha Arom, has published many sound
recordings such as these.
Two pygmy girls singing in
counterpoint
|
|
Pygmy men and boys singing after the
hunt
Source: Simha Arom
|
Click to enlarge
|
Tuvan. The nomadic Tuvan people
of Central Asia developed a form of throat singing that
can produce overtones and multiphonics, that is,
overtones so strong they can be heard as multiple pitches.
The high pitches that you hear in this brief example are
produced by a manipulation of the tongue at the front of
the mouth at the same time as the throat is being
constricted at the back. As with the Inuit, circular
breathing allows the sound to be sustained for long
periods of time.
Tuvan throat singing
|
Last sequence of the recording;
click to enlarge
|
Index
Q. Try this review quiz to
test your comprehension of the above material,
and perhaps to clarify some distinctions you may
have missed.
home